AITopics | capability evaluation

Collaborating Authors

capability evaluation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports

McCaslin, Tegan, Alaga, Jide, Nedungadi, Samira, Donoughe, Seth, Reed, Tom, Bommasani, Rishi, Painter, Chris, Righetti, Luca

arXiv.org Artificial IntelligenceSep-4-2025

Evaluations of dangerous AI capabilities are important for managing catastrophic risks. Public transparency into these evaluations - including what they test, how they are conducted, and how their results inform decisions - is crucial for building trust in AI development. We propose STREAM (A Standard for Transparently Reporting Evaluations in AI Model Reports), a standard to improve how model reports disclose evaluation results, initially focusing on chemical and biological (ChemBio) benchmarks. Developed in consultation with 23 experts across government, civil society, academia, and frontier AI companies, this standard is designed to (1) be a practical resource to help AI developers present evaluation results more clearly, and (2) help third parties identify whether model reports provide sufficient detail to assess the rigor of the ChemBio evaluations. We concretely demonstrate our proposed best practices with "gold standard" examples, and also provide a three-page reporting template to enable AI developers to implement our recommendations more easily.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2508.09853

Country:

North America > United States (1.00)
Europe (0.67)

Genre:

Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (0.92)
Research Report > Strength High (0.92)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods

Grey, Markov, Segerie, Charbel-Raphaël

arXiv.org Artificial IntelligenceMay-12-2025

As frontier AI systems advance toward transformative capabilities, we need a parallel transformation in how we measure and evaluate these systems to ensure safety and inform governance. While benchmarks have been the primary method for estimating model capabilities, they often fail to establish true upper bounds or predict deployment behavior. This literature review consolidates the rapidly evolving field of AI safety evaluations, proposing a systematic taxonomy around three dimensions: what properties we measure, how we measure them, and how these measurements integrate into frameworks. We show how evaluations go beyond benchmarks by measuring what models can do when pushed to the limit (capabilities), the behavioral tendencies exhibited by default (propensities), and whether our safety measures remain effective even when faced with subversive adversarial AI (control). These properties are measured through behavioral techniques like scaffolding, red teaming and supervised fine-tuning, alongside internal techniques such as representation analysis and mechanistic interpretability. We provide deeper explanations of some safety-critical capabilities like cybersecurity exploitation, deception, autonomous replication, and situational awareness, alongside concerning propensities like power-seeking and scheming. The review explores how these evaluation methods integrate into governance frameworks to translate results into concrete development decisions. We also highlight challenges to safety evaluations - proving absence of capabilities, potential model sandbagging, and incentives for "safetywashing" - while identifying promising research directions. By synthesizing scattered resources, this literature review aims to provide a central reference point for understanding AI safety evaluations.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.05541

Country: North America > United States (0.27)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback

MiningGPT -- A Domain-Specific Large Language Model for the Mining Industry

Demartini, Kurukulasooriya Fernando ana Gianluca

arXiv.org Artificial IntelligenceDec-2-2024

Recent advancements of generative LLMs (Large Language Models) have exhibited human-like language capabilities but have shown a lack of domain-specific understanding. Therefore, the research community has started the development of domain-specific LLMs for many domains. In this work we focus on discussing how to build mining domain-specific LLMs, as the global mining industry contributes significantly to the worldwide economy. We report on MiningGPT, a mining domain-specific instruction-following 7B parameter LLM model which showed a 14\% higher mining domain knowledge test score as compared to its parent model Mistral 7B instruct.

dataset, mininggpt, miningpile, (16 more...)

arXiv.org Artificial Intelligence

2412.01189

Country:

Oceania > Australia > Queensland (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.94)

Industry:

Materials > Metals & Mining (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Towards evaluations-based safety cases for AI scheming

Balesni, Mikita, Hobbhahn, Marius, Lindner, David, Meinke, Alexander, Korbak, Tomek, Clymer, Joshua, Shlegeris, Buck, Scheurer, Jérémy, Stix, Charlotte, Shah, Rusheb, Goldowsky-Dill, Nicholas, Braun, Dan, Chughtai, Bilal, Evans, Owain, Kokotajlo, Daniel, Bushnaq, Lucius

arXiv.org Artificial IntelligenceNov-7-2024

We sketch how developers of frontier AI systems could construct a structured rationale -- a 'safety case' -- that an AI system is unlikely to cause catastrophic outcomes through scheming. Scheming is a potential threat model where AI systems could pursue misaligned goals covertly, hiding their true capabilities and objectives. In this report, we propose three arguments that safety cases could use in relation to scheming. For each argument we sketch how evidence could be gathered from empirical evaluations, and what assumptions would need to be met to provide strong assurance. First, developers of frontier AI systems could argue that AI systems are not capable of scheming (Scheming Inability). Second, one could argue that AI systems are not capable of posing harm through scheming (Harm Inability). Third, one could argue that control measures around the AI systems would prevent unacceptable outcomes even if the AI systems intentionally attempted to subvert them (Harm Control). Additionally, we discuss how safety cases might be supported by evidence that an AI system is reasonably aligned with its developers (Alignment). Finally, we point out that many of the assumptions required to make these safety arguments have not been confidently satisfied to date and require making progress on multiple open research problems.

ai system, evaluation, safety case, (16 more...)

arXiv.org Artificial Intelligence

2411.03336

Country:

North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
Pacific Ocean > North Pacific Ocean > San Francisco Bay > Golden Gate (0.04)
Oceania > Papua New Guinea > Gulf Province > Kerema (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (1.00)
Government > Regional Government > North America Government > United States Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

van der Weij, Teun, Hofstätter, Felix, Jaffe, Ollie, Brown, Samuel F., Ward, Francis Rhys

arXiv.org Artificial IntelligenceJun-14-2024

Trustworthy capability evaluations are crucial for ensuring the safety of AI systems, and are becoming a key component of AI regulation. However, the developers of an AI system, or the AI system itself, may have incentives for evaluations to understate the AI's actual capability. These conflicting interests lead to the problem of sandbagging $\unicode{x2013}$ which we define as "strategic underperformance on an evaluation". In this paper we assess sandbagging capabilities in contemporary language models (LMs). We prompt frontier LMs, like GPT-4 and Claude 3 Opus, to selectively underperform on dangerous capability evaluations, while maintaining performance on general (harmless) capability evaluations. Moreover, we find that models can be fine-tuned, on a synthetic dataset, to hide specific capabilities unless given a password. This behaviour generalizes to high-quality, held-out benchmarks such as WMDP. In addition, we show that both frontier and smaller models can be prompted, or password-locked, to target specific scores on a capability evaluation. Even more, we found that a capable password-locked model (Llama 3 70b) is reasonably able to emulate a less capable model (Llama 2 7b). Overall, our results suggest that capability evaluations are vulnerable to sandbagging. This vulnerability decreases the trustworthiness of evaluations, and thereby undermines important safety decisions regarding the development and deployment of advanced AI systems.

ai system, capability evaluation, evaluation, (15 more...)

arXiv.org Artificial Intelligence

2406.07358

Country:

North America > United States (0.46)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report > New Finding (0.86)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An AI System Evaluation Framework for Advancing AI Safety: Terminology, Taxonomy, Lifecycle Mapping

Xia, Boming, Lu, Qinghua, Zhu, Liming, Xing, Zhenchang

arXiv.org Artificial IntelligenceMay-15-2024

The advent of advanced AI underscores the urgent need for comprehensive safety evaluations, necessitating collaboration across communities (i.e., AI, software engineering, and governance). However, divergent practices and terminologies across these communities, combined with the complexity of AI systems-of which models are only a part-and environmental affordances (e.g., access to tools), obstruct effective communication and comprehensive evaluation. This paper proposes a framework for AI system evaluation comprising three components: 1) harmonised terminology to facilitate communication across communities involved in AI safety evaluation; 2) a taxonomy identifying essential elements for AI system evaluation; 3) a mapping between AI lifecycle, stakeholders, and requisite evaluations for accountable AI supply chain. This framework catalyses a deeper discourse on AI system evaluation beyond model-centric approaches.

ai model, ai system, evaluation, (10 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3664646.3664766

2404.05388

Country:

North America > United States (0.28)
Oceania > Australia > New South Wales > Sydney (0.05)
Oceania > Australia > Australian Capital Territory > Canberra (0.04)
Africa > Eswatini > Manzini > Manzini (0.04)

Genre: Research Report (0.40)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)

Add feedback